
\section{Performance}

\label{sec:performance}

We have evaluated the performance of the GPCG implementation
on a variety of architectures.
The data presented in this section 
was generated on the IBM SP
(each processor has 256 MB RAM, 128 KB cache
for data, and a 32 KB cache for instructions)
at Argonne National Laboratory; performance trends
were similar on other machines.

\begin{figure} 
\centerline{\epsfig{figure = pjb.eps, height=2.5in}}
\caption{The journal bearing problem with $ \varepsilon = 0.9 $.\label{pjb}}
\end{figure}

As a benchmark application we have used a journal bearing model,
a variational problem over a two-dimensional region.
This problem arises in the determination of the
pressure distribution in a thin film of lubricant
between two circular cylinders.
The infinite-dimensional version of this problem is
of the form
\[
\min \{ q ( v ) : v \ge 0 , \ v = 0 \mbox{ on } \partial D \} ,
\]
where $ v : \cD \mapsto \R $ is piecewise continuously differentiable,
$ q : H^1 \to \R $ is the quadratic 
\[
q (v) =
 \int_{ \cD } \left \{ \half  w_q (x) \| \grad v (x) \| ^ 2  -
  w_l (x) v (x) \right \} \, d x , 
\]
$ \cD = ( 0 , 2 \pi ) \times ( 0 , 2b ) $ for some constant $ b > 0 $,
and
\[
 w_q ( \xi_1 , \xi_2 ) = ( 1 + \varepsilon \cos \xi_1 ) ^ 3 , \quad
 w_l ( \xi_1 , \xi_2 ) = \varepsilon \sin \xi_1 ,
\]
where $ \varepsilon $ in $ (0,1) $ is the eccentricity parameter.
The eccentricity parameter
influences, in particular, the difficulty of the problem.
Figure \ref{pjb} shows the solution of the journal bearing problem
for $ \varepsilon = 0.9 $. The steep gradient in the solution
makes this problem a difficult benchmark.

Discretization of the journal bearing problem with either finite differences or
finite elements leads to a problem of the form \Ref{def_bqp} with
$ l \equiv 0 $ and $ u \equiv + \infty $.
The number of variables is $ n = n_x n_y $, where
$ n_x $ and $ n_y $ are, respectively, the number of
grid points in each coordinate direction of the domain $ \cD $.
See \cite{more-toraldo} for a description of the
finite element discretization.

We now analyze the performance of GPCG on large problems,
that is, problems that will not fit into the memory of a
single processor. Specifically,
we used a grid with 
$1600$ points in each direction, leading to a problem with
$ n = 2.56 \cdot 10^6 $ variables.

The initial point $ x_0 $ was set to the lower bound $l$.
We used $\eta_1  = 0.1$ in the test \Ref{pgstop2} to terminate the gradient
projection algorithm and $\eta_2=0.05$ in the test
\Ref{cgstop1} to terminate the conjugate gradient algorithm.
We stopped GPCG when the convergence test 
\Ref{bqp_approximate_sol} was satisfied with $ \tau = 10^{-4} $.

Table \ref{flops-2560K} presents performance data for
GPCG.
We show the number
of processors $p$,
the number of GPCG iterates (iters), the number of
conjugate gradient iterations $ n_{GP} $,
the wall clock solution time (in seconds),
and the percentage of time ($t_{CG}$\%) used by the conjugate
gradient algorithm.
The time in the conjugate gradient algorithm
includes the time spent
computing the preconditioner.
Our design allows the use of several preconditioners, but for the
results in this section we used a block Jacobi preconditioner
with one block per processor, where each subproblem was solved
with ILU(2).

Table \ref{flops-2560K} also notes the
efficiency of GPCG as the number of processors 
goes from 8 to 64 processors. We only present the
efficiency relative to $ p = 8 $, that is,
\[
{\cal E}_{8} = \frac {8 \, T_8} {p \, T_p} ,
\]
where $ T_p $ is the computing time for $p$ processors.
This is appropriate since we are interested in problems
that require a large amount of memory, and in any case,
we could not solve problems with $ n = 2.56 \cdot 10^6 $ variables
on less than eight processors.

\begin{table}[htbp]
\caption{Performance of GPCG on the journal bearing problem
with $ n = 2.56 \cdot 10^6 $.}
\label{flops-2560K}
\begin{center}
\footnotesize
\begin{tabular}{| c r | c c r c r |}
\hline
\multicolumn{1}{|c}{$ \varepsilon $} & 
\multicolumn{1}{c|}{$ p $} & 
\multicolumn{1}{c}{iters} &
\multicolumn{1}{c}{$n_{GP}$} & 
\multicolumn{1}{c}{time} &
\multicolumn{1}{c}{$t_{CG}$\%} & 
\multicolumn{1}{c|}{$ {\cal E}_8 $} \\ \hline
0.1  & 8 & 46 & 431 & 7419 & 86 & 100  \\ 
0.1  & 16 & 45 & 423 & 3706 & 83 & 100  \\
0.1  & 32 & 45 & 427 & 2045 & 82 & 91 \\
0.1  & 64 & 45 & 427 & 1279 & 82 & 73 \\
\hline
0.9  & 8 & 37 & 105 & 2134 & 70 & 100 \\
0.9  & 16 & 37 & 103 & 1124 & 71 & 95 \\
0.9  & 32 & 38 & 100 & 618 & 69 & 86 \\
0.9  & 64 & 38 & 99 & 397 & 68 & 67 \\
\hline
% 0.1  & 8 & 47 & 433 & 8393 & 88 & 100  \\
% 0.1  & 16 & 46 & 434 & 4505 & 86 & 93  \\
% 0.1  & 32 & 46 & 434 & 2461 & 85 & 85  \\
% 0.1  & 64 & 46 & 433 & 1519 & 85 & 69  \\
% \hline
% 0.9  & 8 & 36 & 100 & 2309 & 74 & 100  \\
% 0.9  & 16 & 40 & 100 & 1389 & 77 & 83  \\
% 0.9  & 32 & 41 & 100 & 792 & 75 & 73  \\
% 0.9  & 64 & 40 & 102 & 519 & 73 & 56  \\
% \hline
% 0.1 & 8  & 46 & 502 & 3341  & 89  & 100  \\
% 0.1 & 16 & 45 & 486 & 1781  & 89  &  94 \\
% 0.1 & 32 & 46 & 480 &  928  & 89  &  90 \\
% 0.1 & 64 & 46 & 504 &  522  & 89  &  80 \\
%\hline
% 0.9 & 8 & 37 & 102 & 1407  & 94 & 100 \\
% 0.9 & 16 & 40 & 100 & 816  & 94 &  86 \\
% 0.9 & 32 & 41 & 100 & 392  & 93 &  90 \\
% 0.9 & 64 & 40 & 102 & 217  & 94 &  81   \\
\end{tabular}
\end{center}
\end{table}

The results in Table \ref{flops-2560K} are noteworthy is several ways.
First, the number of iterations of GPCG is remarkably small.
This is surprising because the feasible set \Ref{def_bounds}
has $ 3^n $ faces, and the
GPCG visits only one face on each iteration.
Other strategies can lead to a large number of 
iterates, but the GPCG algorithm is remarkably efficient.

Another interesting aspect of the results in Table \ref{flops-2560K}
is that due to the
low memory requirements of iterative solvers, we were able
to solve these problems with only $ p = 8 $ processors.
Strategies that rely on direct solvers are likely to need
significantly more storage, and thus more processors.
Finally, these results show that the GPCG implementation has
excellent efficiency with 
respect to $ p = 8 $ processors,
ranging between $ 67\% $ and $ 100\% $.
This sustained efficiency is remarkable because the 
GPCG algorithm is solving a sequence
of linear problems with a
coefficient matrix set to the submatrix of the Hessian of $q$ with 
respect to the
free variables for the current iterate.
Thus, our implementation's repartitioning of submatrices deals effectively
with the load-balancing problem that is inherent
in the GPCG algorithm.

For these results we have noted that as $ \varepsilon $ increases,
both $t_{CG}\%$ and the overall efficiency decrease.
This observation follows from the empirical result
that the number of free constraints
at the solution is inversely proportional to the eccentricity
parameter $ \varepsilon $.
In particular, roughly $ 68 \% $ of the constraints are free at the
solution when $ \varepsilon = 0.1 $, and $ 54\% $ are free for
$ \varepsilon = 0.9 $.
Since the size of the linear system
that the conjugate gradient algorithm needs to solve decreases
as $ \varepsilon $ increases, the time required
by the conjugate gradient algorithm decreases.
Since the parallel efficiency of smaller problems is less than the
parallel efficiency of larger problems, the overall efficiency of GPCG
decreases as  $ \varepsilon $ increases.

\section{Performance Analysis}

\label{sec:analysis}

GPCG is typical of optimization algorithms that
must deal with constrained problems in the sense that these
algorithms have dynamically changing active sets.
In this section we analyze the performance of GPCG.

% The efficiency of an algorithm with respect to a fixed number
% of processors usually increases as the number of variables increases
% because the amount of computation increases faster than the
% communication costs. 
% Note from Lois:  not really ... how about the following instead:

Table \ref{flops-640K} presents performance results for the journal
bearing problem with dimension 640,000.
In comparing these results with those of the larger problem in Table
\ref{flops-2560K}, note that while the number of variables increases
by a factor of four,
the number of iterations and the number of gradient projection iterates,
increase by about a factor of two.
This seems to be fairly typical of GPCG but may not hold for
other optimization algorithms.
Some algorithms for unconstrained problems exhibit mesh independence in the
sense that the number of iterations is independent of the number of
variables, but this does not generally hold 
(see, for example, \cite{ALD00}) for constrained problems.


\begin{table}[htbp]
\caption{Performance of GPCG on the journal bearing problem
with $ n = 640,000 $.}
\label{flops-640K}
\begin{center}
\footnotesize
\begin{tabular}{| c r | c c r c r |}
\hline
\multicolumn{1}{|c}{$ \varepsilon $} & 
\multicolumn{1}{c|}{$ p $} & 
\multicolumn{1}{c}{iters} &
\multicolumn{1}{c}{$n_{GP}$} & 
\multicolumn{1}{c}{time} &
\multicolumn{1}{c}{$t_{CG}$\%} & 
\multicolumn{1}{c|}{$ {\cal E}_8 $} \\ \hline
0.1  & 8 & 27 & 232 & 639 & 78 & 100 \\
0.1  & 16 & 26 & 231 & 365 & 75 & 88 \\ 
0.1  & 32 & 27 & 230 & 220 & 74 & 73 \\
0.1  & 64 & 27 & 228 & 152 & 75 & 53 \\
\hline
0.9  & 8 & 20 & 52 & 199 & 64 & 100 \\
0.9  & 16 & 21 & 54 & 128 & 64 & 78 \\
0.9  & 32 & 20 & 52 & 74 & 61 & 67 \\
0.9  & 64 & 23 & 54 & 58 & 62 & 43 \\
%  0.1  & 2 & 27 & 227 & 2057 & 79 & 100  \\
%  0.1  & 4 & 26 & 227 & 1173 & 79 & 89 \\
%  0.1  & 8 & 27 & 232 & 639 & 78 & 80 \\
%  0.1  & 16 & 26 & 231 & 365 & 75 & 70 \\ 
%  0.1  & 32 & 27 & 230 & 220 & 74 & 58 \\
%  0.1  & 64 & 27 & 228 & 152 & 75 & 42 \\
%  \hline
%  0.9  & 2 & 21 & 58 & 645 & 65 & 100 \\
%  0.9  & 4 & 20 & 54 & 368 & 63 & 88 \\
%  0.9  & 8 & 20 & 52 & 199 & 64 & 81 \\
%  0.9  & 16 & 21 & 54 & 128 & 64 & 63 \\
%  0.9  & 32 & 20 & 52 & 74 & 61 & 54 \\
%  0.9  & 64 & 23 & 54 & 58 & 62 & 35 \\
\hline

\end{tabular}
\end{center}
\end{table}

When analyzing the parallel performance of an algorithm, we must bear
in mind that a problem can scale well only when the ratio of
computation to communication time is sufficiently large.  Thus, for a
particular problem size, scalability tapers off when more processors
are added than can be used effectively.  For GPCG, this effect can be
seen clearly by comparing the results in Table \ref{flops-640K} with
those in Table \ref{flops-2560K}.

An important aspect of the results in Table \ref{flops-640K} is that 
for this particular problem of dimension 640,000, the
efficiency of GPCG 
% is acceptable for $ p \le 8 $ processors but 
drops rapidly with more processors.
To explain the drop in efficiency, we
list in Table \ref{routines} the percentage of time spent in 
the main operations of GPCG.
Note that some of these operations overlap, so the sum
of the percentages always exceeds $ 100\% $.
In this table {\it Vec Red} refers to
vector reductions, such as dot products and norms,
while {\it Vec Local} refers to vector operations such as
$  y \leftarrow \alpha x + y $.

\begin{table}[ht]
\caption{Scalability of GPCG functions
($ n=640,000 $, $ \varepsilon = 0.1 $) }
\label{routines}
\begin{center}
\footnotesize
\begin{tabular}{|c|ccccc|cc|}
\cline{2-8}
\multicolumn{1}{c|}{} &
\multicolumn{5}{c|}{Percentage of time}&
\multicolumn{2}{c|}{Total MFlops} \\
\hline
\multicolumn{1}{|c|}{Number}&
\multicolumn{1}{c}{Mat-Vec}&
\multicolumn{1}{c}{Vec} &
\multicolumn{1}{c}{Vec} &
\multicolumn{1}{c}{Linear}&
\multicolumn{1}{c}{Extract}&
\multicolumn{1}{|c}{Linear}&
\multicolumn{1}{c|}{TAO} \\

\multicolumn{1}{|c|}{Proc.}&
\multicolumn{1}{c}{Multiply}&
\multicolumn{1}{c}{Local} &
\multicolumn{1}{c}{Red} &
\multicolumn{1}{c}{Solve}&
\multicolumn{1}{c}{Submatrix}&
\multicolumn{1}{|c}{Solve}&
\multicolumn{1}{c|}{Solve} \\

\hline
1 & 27 & 15 & 7 & 81 & 1 &26 & 23 \\ 
2 & 30 & 15 & 8 & 83 & 2 &24 & 21 \\ 
4 & 30 & 12 & 8 & 82 & 2 &24 & 21 \\ 
8 & 29 & 11 & 10 & 81 & 2 &22 & 20 \\ 
16 & 26 & 10 & 14 & 78 & 2 &21 & 17 \\ 
32 & 24 & 9 & 22 & 78 & 2 &18 & 15 \\ 
64 & 20 & 5 & 36 & 78 & 2 &12 & 10 \\ 
%  1 & 27 & 15 & 7 & 81 & 1 &26 & 23 \\ 
%  2 & 30 & 15 & 8 & 83 & 2 &47 & 42 \\ 
%  4 & 30 & 12 & 8 & 82 & 2 &94 & 82 \\ 
%  8 & 29 & 11 & 10 & 81 & 2 &179 & 156 \\ 
%  16 & 26 & 10 & 14 & 78 & 2 &333 & 279 \\ 
%  32 & 24 & 9 & 22 & 78 & 2 &563 & 473 \\ 
%  64 & 20 & 5 & 36 & 78 & 2 &790 & 665 \\ 
\hline
\end{tabular}
\end{center}
\end{table}

The percentage of time spent in the various functions
of GPCG generally decreases slightly as the number of processors increases,
with the exception of the vector reductions.
Since vector reductions require
communication among all processors, they have a significant effect
on the efficiency of the algorithm.
Note that the time for vector reductions
remains fairly constant at about $8\%$ of the total computation
time for \mbox{1--8} processors but that the
efficiency of the algorithm declines
quickly as the percentage of time doing vector reductions
increases to $36\%$ on $64$ processors.
This analysis shows that 
the ratio of computation to communication for this problem is too small
for large number of processors and is responsible for
the loss in scalability of GPCG for $ p > 8 $.

In this discussion of efficiency bear in mind
that the Hessian matrix of the journal bearing problem is relatively
sparse with 5 nonzeros per row on average. 
The efficiency is likely to improve if we deal with
matrices with more nonzeros per row,
since then the amount of computation per conjugate gradient iteration 
increases.
%  {\tt Note From Lois:  Not necessarily -- depends on how you define
%  efficiency, and dense problems don't scale well due to global communication requirements.}{\tt Note From Steve:  How about we simply say that efficiency
%  might increase ... ? }
These problems arise, for example,
in three-dimensional simulations or in variational
problems with vector functions, that is, variational
problems that require determining a vector-valued
$ v : \cD \mapsto \R^m $ for $ m > 1 $ that minimizes the quadratic $q$.


A surprising aspect of the results in Table \ref{routines} is that
the percentage of time
required to extract the submatrix remains nearly constant
at $2\%$ of the total computation time, demonstrating the relative 
efficiency of this phase of the computation.
These results are surprising because at first sight the need to
extract an arbitrary submatrix and to re-balance the distribution of
rows across the processors would destroy the efficiency of the algorithm.
On the other hand, the creation of a second matrix to 
hold the submatrix requires
additional storage.  For large problems the additional storage
may exceed the memory capacity of a small number of processors.

Another important component of our scalability analysis is
the flop rate per processor.
As noted in Table \ref{routines}, the flop rate for the
linear solve component of GPCG is 26 MFlops for
one processor and decreases to about $ 12.3 $ for 64 processors.
For comparison purposes, the flop rate of a Newton algorithm in
PETSc is about 42 MFlops for one processor on a system of nonlinear
equations with the same sparsity as the journal bearing problem.
This rate is higher than the rate achieved by the GPCG algorithm,
but this is to be expected because,  as previously mentioned, 
the GPCG algorithm spends a significant amount of time on tasks
with no arithmetic operations.  The extraction of the submatrix,
creating the reduced linear system and determining the
free variables, typically require more than $ 10\% $ of the time.
Hence, it is unlikely that the GPCG algorithm, or any 
active set algorithm for constrained problems, can
achieve a computation rate as high as a Newton algorithm.

While these computations employed a standard 
compressed, sparse row format for matrix data,
higher flop rates could be obtained on
some problems by changing the matrix format.
Alternative storage
schemes that exploit the structured sparsity of these problems would
achieve higher flop rates for matrix operations by alleviating
unnecessary memory references.  Likewise, block sparse storage
variants for problems with multiple unknowns per grid point would
achieve higher flop rates \cite{gkmt98}.  Since our optimization
algorithms use a data-structure-neutral interface to matrix and vector
operations, we can easily experiment with such alternatives without
altering any of the optimization code.

\section{Preconditioners}

\label{sec:preconditioners}

The ability to experiment with various preconditioners is
a direct result of our design philosophy, which enables
connection to the 
linear algebra infrastructure provided in toolkits such as PETSc.
In particular, we compared the diagonal Jacobi
preconditioner with a block Jacobi preconditioner that used
one block per processor.
We employed sparse matrix based ILU as a subdomain solver for the block
Jacobi method, where we considered both ILU(0), which produced a
factored matrix that maintained the same sparsity pattern as the
subdomain matrix, and ILU(2), which allowed two levels of fill.

The statistics summarized in Table \ref{preconditioners}
are the eccentricity parameter $ \varepsilon$,
the number of processors $p$,
the number of iterations of GPCG,
the time required to solve the problem (in seconds),
and the number of conjugate gradient iterations.
We present results only for $ n = 640,000 $, since similar
results were obtained for $ n = 2,560,000 $.

\begin{table}[htbp]
\caption{Performance of preconditioners in GPCG ($ n = 640,000 $)}
\label{preconditioners}
\begin{center}
\footnotesize
\begin{tabular}{| c r | c c c | c c c | c c c |}
\cline{3-11}
\multicolumn{2}{c}{} &
\multicolumn{3}{|c|}{Diagonal} &
\multicolumn{3}{c|}{Block Jacobi - ILU(0)} &
\multicolumn{3}{c|}{Block Jacobi - ILU(2)} \\
\hline
\multicolumn{1}{|c}{$ \varepsilon $} & 
\multicolumn{1}{c|}{$ p $} & 
\multicolumn{1}{c}{iters} &
\multicolumn{1}{c}{time} & 
\multicolumn{1}{c|}{CG iters} & 
\multicolumn{1}{c}{iters} &
\multicolumn{1}{c}{time} & 
\multicolumn{1}{c|}{CG iters} & 
\multicolumn{1}{c}{iters} &
\multicolumn{1}{c}{time} &
\multicolumn{1}{c|}{CG iters} \\ \hline
 % 0.1 &  1 & 26 & 9438 & 37045 & 26  & 4294 & 7745 & & & \\
 0.1 &  4 & 26 & 2928 & 37045 & 27  & 1324 & 8679  & 26 & 1173 & 6312 \\
 0.1 & 16 & 26 & 851  & 37045 & 27  & 409  & 9105  & 26 & 364 & 6712 \\
\hline                                             
%  0.9 &  1 & 21 & 3909 & 18118 & 23 & 1567 & 2903 & 21 & 1302 & 1569\\
 0.9 &  4 & 21 & 1216 & 18118 & 20 & 416  & 2654 & 20 & 368 & 1864\\
 0.9 & 16 & 22 & 390  & 18118 & 23 & 150  & 3390 & 21 & 128 & 2303\\

% 0.1 &  1 &  27 &  1749 &  2806 & 27 &  3896 &  13539 \\
% 0.1 &  4 &  27 &   572 &  3060 & 27 &  1185 &  13513 \\
% 0.1 & 16 &  27 &   166 &  3225 & 27 &   341 &  13511 \\
% \hline                                                   
% 0.9 &  1 &  24 &   707 &  1205 & 22 &  1694 &  7014 \\
% 0.9 &  4 &  24 &   240 &  1120 & 22 &   557 &  7014 \\
% 0.9 & 16 &  24 &    77 &  1379 & 22 &   158 &  7014 \\
\hline
\end{tabular}
\end{center}
\end{table}

The number of GPCG iterations in Table \ref{preconditioners}
is independent of the number of
processors and of the preconditioner. 
In general we expect small variations in the number of iterations 
because different preconditioners create different approximate solutions
to linear systems and different paths to the solution.

In these experiments we were interested in the impact of
the preconditioner on the total time to solution. The Jacobi method is
scalable, so the main issue is whether the higher computational
cost of the block Jacobi is justified.
As expected, the block Jacobi preconditioner with subdomain solver ILU(2)
required fewer conjugate gradient iterations than subdomain solver ILU(0),
and both block Jacobi preconditioners required fewer
iterations than the point Jacobi method.
In addition, the block Jacobi methods also required less time.
In general, better preconditioners require more time to compute, and
this additional cost sometimes negates the savings achieved from fewer
iterations of the linear solver. 
In this problem, 
the block Jacobi preconditioners used
about half of  the time required by the diagonal preconditioner, and
the additional cost of computing better preconditioners is justified.
The most expensive preconditioner to compute of the three under consideration
in this work, namely, the block Jacobi method with subdomain solver ILU(2),
produced the fewest iterations by
the conjugate gradient method and the smallest overall solution time.

The ability to experiment easily with a variety of preconditioners is
an advantage because we can then choose a technique that is most
suitable to the problem.  In this spirit, we plan to experiment with
the evolving interfaces under development by the Equation Solver
Interface (ESI) (see \url{http://z.ca.sandia.gov/esi}) 
and Common Component Architecture
(CCA) \cite{cca99} working groups, with a goal of
enabling dynamic use within TAO of any ESI-compliant preconditioning
components.

\section{Concluding Remarks}

Our aim is to develop scalable optimization algorithms
for important classes of optimization problems.
For variational problems like the journal bearing problem
described in Section~\ref{sec:performance}, this means that
the computing time grows slowly with increasing $n$
provided the ratio $ n/p $ of variables to processors stays constant.
This case study examined some of the issues that must be
addressed in order to achieve scalability;
further progress will require making use of multigrid
techniques and the relationship between grids.

An important ingredient in the development of scalable
optimization algorithms and software is a flexible interface that
supports a wide variety of data structures and algorithms.
We have shown that the TAO design leverages external
parallel computing infrastructure and
linear algebra toolkits to solve
large-scale optimization problems on high-performance
architectures.
With the exception of the work of 
Biros and Ghattas \cite{GB99b,GB99a},
other codes for large-scale optimization
problems are either custom-written
or restricted to uni-processor environments.

TAO \cite{tao-user-ref}
extends to general nonlinearly bound-constrained optimization,
but the performance issues are more subtle due to the
impact of user-supplied function, gradient and Hessian code.
Extensions of TAO to large
linearly-constrained and nonlinearly-constrained optimization
problems is currently an active research area.

\section*{Acknowledgments}

The development of TAO would not have been possible without the
support and guidance of Satish Balay,
Bill Gropp, and Barry Smith.
They, together with Lois McInnes, are the main developers of PETSc.
